Goto

Collaborating Authors

 training sample


Skew-adaptive conformal prediction

arXiv.org Machine Learning

We develop a skew-adaptive extension of split conformal prediction for regression. The method starts from an asymmetric interval family centered at a point prediction and uses the gauge approach to deduce the conformity score induced by this family. The inverse hyperbolic sine transform of signed scaled residuals provides the training target for an additional predictive model, whose role is to learn how predictive uncertainty should tilt across the feature space. The resulting procedure preserves the finite-sample marginal validity of split conformal prediction under exchangeability, while producing intervals that adapt to both local scale and local skewness. We also develop a calibration-sample-based estimator for comparing the expected relative future width of the skew-adaptive and classical scaled-score intervals. Experiments on a variety of datasets indicate gains in prediction interval efficiency over the scaled-score construction and conformalized quantile regression, and show that the proposed estimator closely matches the corresponding average width ratio observed on the test sample.


Training-Free Generative Sampling via Moment-Matched Score Smoothing

arXiv.org Machine Learning

Diffusion models generate samples by denoising along the score of a perturbed target distribution. In practice, one trains a neural diffusion model, which is computationally expensive. Recent work suggests that score matching implicitly smooths the empirical score, and that this smoothing bias promotes generalization by capturing low-dimensional data geometry. We propose moment-matched score-smoothed overdamped Langevin dynamics (MM-SOLD), a training-free interacting particle sampler that enforces the target moments throughout the sampling trajectory. We prove that, in the large-particle limit, the empirical particle density converges to a deterministic limit whose one-particle stationary marginal is a Gibbs--Boltzmann density obtained by exponentially tilting a naive score-smoothed diffusion target. The mean and covariance of this distribution agree with the empirical moments of the training data. Experiments on 2D distributions and latent-space image generation show that MM-SOLD enables fast, robust, training-free sampling on CPUs, with sample fidelity and diversity competitive with neural diffusion baselines.


The two clocks and the innovation window: When and how generative models learn rules

arXiv.org Machine Learning

Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $τ_{\mathrm{rule}}$, the step at which generations first become rule-valid, and $τ_{\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $τ_{\mathrm{rule}}$ and $τ_{\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $τ_{\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $τ_{\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \emph{innovation window} as the interval $[τ_{\mathrm{rule}}, τ_{\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $τ_{\mathrm{rule}} \geq τ_{\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $τ_{\mathrm{rule}}$, while training samples' basins begin to dominate around $τ_{\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.


Imbalanced Classification under Capacity Constraints

arXiv.org Machine Learning

In many classification settings, the class of primary interest is underrepresented, leading to imbalanced data problems that arise in applications such as rare disease detection and fraud identification. In these contexts, identifying a potential positive instance typically triggers costly follow-up actions, such as medical imaging or detailed transaction inspection, which are subject to limited operational capacity. Motivated by this setting, we consider classification problems where data may arrive sequentially and decisions must be made under constraints on the number of instances that can be selected for further analysis. We propose a classification framework that explicitly controls the rate of positive predictions, enforcing a user-defined bound on the proportion of observations classified as belonging to the minority class while maximizing detection performance. The approach can be implemented using standard learning methods and naturally extends to online settings, where decisions are taken in real time. We show that incorporating capacity constraints leads to substantial improvements over classical approaches, including resampling techniques such as SMOTE, which do not directly control the selection rate.


Static and Sequential Malicious Attacks in the Context of Selective Forgetting

Neural Information Processing Systems

With the growing demand for the right to be forgotten, there is an increasing need for machine learning models to forget sensitive data and its impact. To address this, the paradigm of selective forgetting (a.k.a machine unlearning) has been extensively studied, which aims to remove the impact of requested data from a well-trained model without retraining from scratch. Despite its significant success, limited attention has been given to the security vulnerabilities of the unlearning system concerning malicious data update requests. Motivated by this, in this paper, we explore the possibility and feasibility of malicious data update requests during the unlearning process. Specifically, we first propose a new class of malicious selective forgetting attacks, which involves a static scenario where all the malicious data update requests are provided by the adversary at once. Additionally, considering the sequential setting where the data update requests arrive sequentially, we also design a novel framework for sequential forgetting attacks, which is formulated as a stochastic optimal control problem. We also propose novel optimization algorithms that can find the effective malicious data update requests. We perform theoretical analyses for the proposed selective forgetting attacks, and extensive experimental results validate the effectiveness of our proposed selective forgetting attacks. The source code is available in the supplementary material.


Tail allocation for conformal prediction intervals

arXiv.org Machine Learning

We study split-conformal prediction for regression when the reported prediction set must be a single interval, at target marginal coverage $1-α$, where $α$ is the nominal miscoverage level. Under this reporting constraint, the natural conditional target is the shortest interval with conditional mass at least $1-α$, rather than an equal-tailed interval or a possibly disconnected high-probability set. We parameterize this single-interval oracle by a lower-tail allocation, which determines how the nominal miscoverage $α$ is split between the two endpoints, and propose tail-allocation conformalized quantile regression (TA-CQR). TA-CQR estimates this allocation by searching over quantile-defined cores and then applies nonnegative additive split-conformal calibration, retaining exact finite-sample marginal coverage under exchangeability. The main contribution is theoretical. We characterize the oracle geometry, including its highest-density interpretation under unimodality and the positive connectedness cost induced by disconnected highest-density sets. We prove local recovery of the selected allocation and core, establish that calibration radii are asymptotically negligible under endpoint-density conditions, and give a finite-sample calibrated length oracle inequality with explicit grid, endpoint-quantile estimation, and calibration-sampling terms. Simulations and real-data examples report coverage and length jointly.



4c4c937b67cc8d785cea1e42ccea185c-Supplemental.pdf

Neural Information Processing Systems

Proof of Proposition 1. Due to Jensen's inequality and the fact that, by assumption, the distribution of human predictions P(h|x) is not a point-mass, it holds that Eh[`(h(x),y) |x] > `(µh(x),y). Proof of Theorem 3. We first provide the proof of the unconstrained case. Note that the above problem is a linear program and it decouples with respect to x. Therefore, for each x, the optimal solution is clearly given by: π m(d= 1 |x) = 1 if Ey|x[`(m(x),y) Eh|x[`(h,y)]] >0 0 otherwise Next, we provide the proof of the constrained case. To this aim, we consider the dual formulation of the optimization problem, where we only introduce a Lagrangian multiplier τP,b for the first constraint, i.e., maximize Ex π(x) Ey,h|x[`(h,y)] Ey|x[`(m(x),y)] + Ex [τP,b(π(x) b)] (13) subject to 0 π(x) 1 x X. (14) 13 The inner minimization problem can be solved using the similar argument for the unconstrained case.


Revisit the Power of Vanilla Knowledge Distillation: from Small Scale to Large Scale Supplementary Material

Neural Information Processing Systems

A.1 Details of "stronger recipe" In Table 1 of our main paper, we evaluate the impact of limited model capacity [1] and small-scale dataset by comparing the results of using "previous training recipe" and our "stronger recipe". We summarize the details of "stronger recipe" and present them in Table 13. Table 13: Stronger training strategy used for distillation. "B" and "C" represent strategies for training students on ImageNet-1K and CIFAR100, respectively. A.2 Numerical results In Figure 1 of our main paper, we present a comparison of performance gaps among vanilla KD and two logits-based baselines, i.e., DKD [2] and DIST [3], on two datasets of varying scales, to demonstrate the underestimation of vanilla KD on small-scale datasets.


On Learning Latent Models with Multi-Instance Weak Supervision

Neural Information Processing Systems

We consider a weakly supervised learning scenario where the supervision signal is generated by a transition function σ of labels associated with multiple input instances. We formulate this problem as multi-instance Partial Label Learning (multi-instance PLL). Our problem is an extension to the standard PLL problem and is met in different fields, including latent structural learning and neuro-symbolic integration. Despite the existence of many learning techniques, limited theoretical analysis has been dedicated to this problem. In this paper, we provide the first theoretical study of multi-instance PLL with possibly an unknown transition σ.